Text Analysis

Sentiment Analysis

Table of Contents


Words As Data

Words are everywhere. Believe it or not, you are reading words right now! Given our penchant for taking things and making numbers out of them, you are probably already guessing that we can somehow make words tell a story with numbers. If that is what you are guessing, then you are absolutely correct.

Processing Text

Before we can even begin to dive into analyzing text, we must first process the text. Processing text involves several steps that will be combined in various ways, depending on what we are trying to accomplish.

Stemming

Tense aside, are chewed, chew, and chewing the same thing? Yes, but what if we compare the actual strings? On a string comparison side, are they the same? No. We have a string with 6, 4, and 7 characters, respectively.

What if we remove the suffixes, “ed” and “ing” – we are left with three instances of “chew”? Now we have something that is equivalent in meaning and in a string sense. This is the goal of stemming.

Let’s take a look to see how this works (you will need to install tm and SnowballC first):


chewStrings = c("chew", "chewing", "chewed", "chewer")

tm::stemDocument(chewStrings)

[1] "chew"   "chew"   "chew"   "chewer"

We got exactly what we expected, right? You might have noticed that “chewer” did not get stemmed. Do you have any idea why? Let’s think through it together. “Chew”, “chewing”, and “chewed” are all verbs related to the act of chewinging. “Chewer”, on the other hand, is a person who chews – it is a noun. Martin Porter’s stemming algorithm works incredibly well!

Hopefully, this makes conceptual sense; however, we also need to understand why we need to do it. In a great many text-based methods, we are going to create a matrix that keeps track of every term (i.e., word) in every document – this is know as a document-term matrix. If we know that “chew”, “chewing”, and “chewed” all refer to the same thing, we want it just represented once within our document-term matrix.

Shall we take a look?


library(tm)

documents = c("I like to chew", 
              "I have chewed my whole life", 
              "Chewing and stomping through the fields", 
              "I am a jumper")

documentsCorp = tm::SimpleCorpus(VectorSource(documents))

documentsDTM = DocumentTermMatrix(documentsCorp)

inspect(documentsDTM)

<<DocumentTermMatrix (documents: 4, terms: 13)>>
Non-/sparse entries: 13/39
Sparsity           : 75%
Maximal term length: 8
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs and chew chewed chewing fields have life like stomping whole
   1   0    1      0       0      0    0    0    1        0     0
   2   0    0      1       0      0    1    1    0        0     1
   3   1    0      0       1      1    0    0    0        1     0
   4   0    0      0       0      0    0    0    0        0     0

We can see that without stemming, we have 9 terms (things like “I”, “a”, and “to” get removed automatically). Let’s do some stemming now:


documentsStemmed = stemDocument(documents)

documentsStemmed

[1] "I like to chew"                  
[2] "I have chew my whole life"       
[3] "Chew and stomp through the field"
[4] "I am a jumper"                   

And now the document-term matrix:


stemmedDocCorp = tm::SimpleCorpus(VectorSource(documentsStemmed))

stemmedDocDTM = DocumentTermMatrix(stemmedDocCorp)

inspect(stemmedDocDTM)

<<DocumentTermMatrix (documents: 4, terms: 11)>>
Non-/sparse entries: 13/31
Sparsity           : 70%
Maximal term length: 7
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs and chew field have life like stomp the through whole
   1   0    1     0    0    0    1     0   0       0     0
   2   0    1     0    1    1    0     0   0       0     1
   3   1    1     1    0    0    0     1   1       1     0
   4   0    0     0    0    0    0     0   0       0     0

If we are trying to find documents that are covering similar content or talking about similar things, this document-term matrix will help to draw better conclusions, because it is clear that the first three documents are talking about the act of chewing and this document-term matrix reflects that.

Lemmatization

Stemming is often enough (and most modern stemmers work pretty well on their own). Still, stemming is slightly more akin to amputating an arm with a battle ax – it works, but it is brute force. Lemmatization is a more sophisticated approach. You might have already guessed that lemmatization will find the lemma of a word and since you likely know about morphology, you already know that the lemma of a word is its canonical form. A group of words that form the same idea are called a lexeme (am, be, are are all within the same lexeme). Generally, the smallest form of the word is chosen as the lemma. This is a really interesting area of linguistics, but we don’t need to dive fully in.

Instead, let’s see it in action.

If we compare some “chewing” stuff on stemming and lemmatization, we can see what we get:


library(textstem)

chewStrings = c("chew", "chewing", "chewed", "chewer")

stem_words(chewStrings)

[1] "chew"   "chew"   "chew"   "chewer"

lemmatize_words(chewStrings)

[1] "chew"   "chew"   "chew"   "chewer"

Absolutely nothing different. Both stemming and lemmatizing will perform the same task. The act of jumping is comprised of a past, present, and future tense, and jump is the lemma; jumper is still seen as something else entirely.

But let’s take a look at something different. If we have a string of the most lovely words, what might happen?


lovelyString = c("lovely", "lovelier", "loveliest")

stem_words(lovelyString)

[1] "love"      "loveli"    "loveliest"

That is about as close to nonsense as we could possibly get without going into Dr. Suess mode.

But if we try lemmatization:


lemmatize_words(lovelyString)

[1] "lovely" "lovely" "lovely"

We get something that starts to make sense. Now, let’s try these on some actual chunks of text and see what happens.


# This data is in the "data" folder on Sakai!

load("C:/Users/sberry5/Documents/teaching/courses/unstructured/data/allLyricsDF.RData")

sampleLyrics = allLyricsDF[40, ]

sampleLyrics$lyrics

[1] \n          \n            \n            [Verse 1]The preacher man says it's the end of time\nAnd the Mississippi River she's a-goin' dry\nThe interest is up and the stock market's down\nAnd you only get mugged if you go downtownI live back in the woods, you see\nMy woman and the kids and the dogs and meI got a shotgun, a rifle and a 4-wheel driveAnd a country boy can survive, country folks can surviveI can plow a field all day long\nI can catch catfish from dusk 'til dawn\nWe make our own whiskey and our own smoke, too\nAin't too many things these old boys can't do\nWe grow good old tomatoes and homemade wine\nAnd a country boy can survive, country folks can survive\n[Chorus 1]Because you can't starve us out and you can't make us run\n'Cause we're them old boys raised on shotgunsAnd we say grace and we say Ma'am\nIf you ain't into that we don't give a damn\nWe came from the West Virginia coal mines\nAnd the Rocky Mountains and the western skies\nAnd we can skin a buck, we can run a trotline\nAnd a country boy can survive, country folks can survive\n[Verse 2]\nI had a good friend in New York City\nHe never called me by my name, just "hillbilly"\nMy grandpa taught me how to live off the land\nAnd his taught him to be a businessman\nHe used to send me pictures of the Broadway nights\nAnd I'd send him some homemade wine\nBut he was killed by a man with a switchblade knife\nFor 43 dollars my friend lost his lifeI'd love to spit some Beech-Nut in that dude's eyes, and shoot him with my old 45\n'Cause a country boy can survive, country folks can survive\n[Chorus 2]\n'Cause you can't starve us out and you can't make us run\n'Cause we're them old boys raised on shotguns\nAnd we say grace and we say Ma'am\nAnd if you ain't into that we don't give a damnWe're from north California and south Alabama\nAnd little towns all around this land\nAnd we can skin a buck, and run a trotline\nAnd a country boy can survive, country folks can survive\n[Outro]\nA country boy can survive\nCountry folks can survive\n\n\n            \n          \n        
3077 Levels: \n          \n            \n            I'll need time to get you off my mind\nAnd I may sometimes bother you\nTry to be in touch with you\nEven ask too much of you from time to time\nNow and then\nLord, you know I'll need a friend\nAnd 'till I get used to losing you\nLet me keep on using you\n'Til I can make it on my own\nI'll get by, but no matter how I try\nThere'll be times that you'll know I'll call\nChances are my tears will fall\nAnd I'll have no pride at all, from time to time\nBut they say, oh, there'll be a brighter day\nBut 'til then I lean on you\nThat's all I mean to do\n'Til I can make it on my own\nSurely someday I'll look up and see the morning sun\nWithout another lonely night behind me\nThen I'll know I'm over you and all my crying's done\nNo more hurtin' memories can find me\nBut 'til then\nLord, You know I'm gonna need a friend\n'Til I get used to losing you\nLet me keep on using you\n'Til I can make it on my own\n'Til I can make it on my own\n\n\n            \n          \n         ...

Of course, we will need to do some cleaning on our text first:


library(dplyr)

library(stringr)

cleanLyrics = sampleLyrics$lyrics %>% 
  str_replace_all(., "\n", " ") %>% 
  str_replace_all(., "\\[[A-Za-z]+\\s*[0-9]*]", "") %>%
  str_squish(.) %>% 
  gsub("([a-z])([A-Z])", "\\1 \\2", .)

We have to try the obligatory stemming:


stem_strings(cleanLyrics)

[1] "The preacher man sai it' the end of time And the Mississippi River she' a - goin' dry The interest i up and the stock market' down And you onli get mug if you go downtown I live back in the wood, you see My woman and the kid and the dog and me I got a shotgun, a rifl and a 4 - wheel drive And a countri boi can surviv, countri folk can surviv I can plow a field all dai long I can catch catfish from dusk 'til dawn We make our own whiskei and our own smoke, too Ain't too mani thing these old boi can't do We grow good old tomato and homemad wine And a countri boi can surviv, countri folk can surviv Becaus you can't starv u out and you can't make u run 'Caus we'r them old boi rais on shotgun And we sai grace and we sai Ma'am If you ain't into that we don't give a damn We came from the West Virginia coal mine And the Rocki Mountain and the western ski And we can skin a buck, we can run a trotlin And a countri boi can surviv, countri folk can surviv I had a good friend in New York Citi He never call me by my name, just \" hillbilli \" My grandpa taught me how to live off the land And hi taught him to be a businessman He us to send me pictur of the Broadwai night And I'd send him some homemad wine But he wa kill by a man with a switchblad knife For 43 dollar my friend lost hi life I'd love to spit some Beech - Nut in that dude' ey, and shoot him with my old 45 'Caus a countri boi can surviv, countri folk can surviv 'Caus you can't starv u out and you can't make u run 'Caus we'r them old boi rais on shotgun And we sai grace and we sai Ma'am And if you ain't into that we don't give a damn We'r from north California and south Alabama And littl town all around thi land And we can skin a buck, and run a trotlin And a countri boi can surviv, countri folk can surviv A countri boi can surviv Countri folk can surviv"

And now the lemmatized version:


lemmatize_strings(cleanLyrics)

[1] "The preacher man say it's the end of time And the Mississippi River she's a - goin' spin-dry The interest be up and the stock market's down And you only get mug if you go downtown I live back in the wood, you see My woman and the kid and the dog and me I get a shotgun, a rifle and a 4 - wheel drive And a country boy can survive, country folk can survive I can plow a field all day long I can catch catfish from dusk until dawn We make our own whiskey and our own smoke, too Ain't too many thing this old boy can't do We grow good old tomato and homemade wine And a country boy can survive, country folk can survive Because you can't starve us out and you can't make us run because we're them old boy raise on shotgun And we say grace and we say Ma'am If you ain't into that we don't give a damn We come from the West Virginia coal mine And the Rocky mountain and the western sky And we can skin a buck, we can run a trotline And a country boy can survive, country folk can survive I have a good friend in New York City He never call me by my name, just \" hillbilly \" My grandpa teach me how to live off the land And his teach him to be a businessman He use to send me picture of the Broadway night And I'd send him some homemade wine But he be kill by a man with a switchblade knife For 43 dollar my friend lose his life I'd love to spit some Beech - Nut in that dude's eye, and shoot him with my old 45 because a country boy can survive, country folk can survive because you can't starve us out and you can't make us run because we're them old boy raise on shotgun And we say grace and we say Ma'am And if you ain't into that we don't give a damn We're from north California and south Alabama And little town all around this land And we can skin a buck, and run a trotline And a country boy can survive, country folk can survive A country boy can survive Country folk can survive"

Here is something very interesting:


microbenchmark::microbenchmark(stem_strings(cleanLyrics), 
                               lemmatize_strings(cleanLyrics))

Unit: milliseconds
                           expr      min       lq     mean   median
      stem_strings(cleanLyrics) 1.073419 1.094751 1.129930 1.117570
 lemmatize_strings(cleanLyrics) 2.519763 2.758884 3.096329 2.863662
       uq      max neval cld
 1.147544 1.405839   100  a 
 3.019610 9.916212   100   b

What is the point here? This song has a little over 400 words in it. Stemming, over 100 runs, took on average 1.3 milliseconds, while lemmatizing took 3.6. We are just talking milliseconds, so this is almost to the point where we would not notice; but, if we did this over an entire corpus, we would definitely notice the time.

The question, then, is what do you decide to do. For my money, lemmatization does a better job and getting words down to their actual meaning.

Stop Words

Some words do us very little good: articles, prepistions, and very high frequency words. These are all words that need to be removed. Fortunately, you don’t have to do this on your own – a great many dictionaries exist that contain words ready for removal.


tm::stopwords("en")

  [1] "i"          "me"         "my"         "myself"     "we"        
  [6] "our"        "ours"       "ourselves"  "you"        "your"      
 [11] "yours"      "yourself"   "yourselves" "he"         "him"       
 [16] "his"        "himself"    "she"        "her"        "hers"      
 [21] "herself"    "it"         "its"        "itself"     "they"      
 [26] "them"       "their"      "theirs"     "themselves" "what"      
 [31] "which"      "who"        "whom"       "this"       "that"      
 [36] "these"      "those"      "am"         "is"         "are"       
 [41] "was"        "were"       "be"         "been"       "being"     
 [46] "have"       "has"        "had"        "having"     "do"        
 [51] "does"       "did"        "doing"      "would"      "should"    
 [56] "could"      "ought"      "i'm"        "you're"     "he's"      
 [61] "she's"      "it's"       "we're"      "they're"    "i've"      
 [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
 [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
 [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
 [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
 [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
 [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
 [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
[101] "who's"      "what's"     "here's"     "there's"    "when's"    
[106] "where's"    "why's"      "how's"      "a"          "an"        
[111] "the"        "and"        "but"        "if"         "or"        
[116] "because"    "as"         "until"      "while"      "of"        
[121] "at"         "by"         "for"        "with"       "about"     
[126] "against"    "between"    "into"       "through"    "during"    
[131] "before"     "after"      "above"      "below"      "to"        
[136] "from"       "up"         "down"       "in"         "out"       
[141] "on"         "off"        "over"       "under"      "again"     
[146] "further"    "then"       "once"       "here"       "there"     
[151] "when"       "where"      "why"        "how"        "all"       
[156] "any"        "both"       "each"       "few"        "more"      
[161] "most"       "other"      "some"       "such"       "no"        
[166] "nor"        "not"        "only"       "own"        "same"      
[171] "so"         "than"       "too"        "very"      

Removing stopwords takes little effort!


documents = c("I like to chew.", 
              "I am stompin and chewing.", 
              "Chewing is in my blood.", 
              "I am a chewer")

tm::removeWords(documents, words = stopwords("en"))

[1] "I like  chew."        "I  stompin  chewing."
[3] "Chewing    blood."    "I   chewer"          

We can even include custom stopwords:


tm::removeWords(documents, words = c("blood", stopwords("en")))

[1] "I like  chew."        "I  stompin  chewing."
[3] "Chewing    ."         "I   chewer"          

There are many different stopword lists out there, so you might want to poke around just a little bit to find something that will suit the needs of a particular project.


library(stopwords)

Text Processing Tools

There are several R packages that will help us process text. The tm package is popular and automates most of our work. You already saw how we use the stemming and stopword removal functions, but tm is full of fun stuff and allows for one pass text processing.


documents = c("I like to chew.", 
              "I am stompin and chewing.", 
              "Chewing is in my blood.", 
              "I am a chewer")

documentCorp = SimpleCorpus(VectorSource(documents))

stopWordRemoval = function(x) {
  removeWords(x, stopwords("en"))
}

textPrepFunctions = list(tolower,
                         removePunctuation,
                         lemmatize_strings,
                         stopWordRemoval,
                         removeNumbers,
                         stripWhitespace)

documentCorp = tm_map(documentCorp, FUN = tm_reduce, tmFuns = textPrepFunctions)

documentCorp[1][[1]]$content

Once you get your text tidied up (or even before), you can produce some visualizations!


library(tidytext)

library(wordcloud2)

allLyricsDF %>%
  filter(warningIndicator == 0) %>% 
  dplyr::select(lyrics, returnedArtistName) %>%
  mutate(lyrics = as.character(lyrics), 
         lyrics = str_replace_all(lyrics, "\n", " "),   
         lyrics = str_replace_all(lyrics, "\\[[A-Za-z]+\\s*[0-9]*]", ""), 
         lyrics = str_squish(lyrics), 
         lyrics = gsub("([a-z])([A-Z])", "\\1 \\2", lyrics)) %>%
  unnest_tokens(word, lyrics) %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>% 
  filter(n > 25) %>% 
  na.omit() %>% 
  wordcloud2(shape = "cardioid")

Sentiment Analysis

Sentiment analysis is commonly used when we want to know the general feelings of what someone has written or said. Sentiment analysis is commonly seen applied to Twitter and other social media posts, but we can use it anywhere where people have written/said something (product reviews, song lyrics, final statements).

Sentiment can take many different forms: positive/negative affect, emotional states, and even financial contexts.

Let’s take a peak at some simple sentiment analysis.

Simple Sentiment

Let’s consider the following statements:


library(tidytext)

statement = "I dislike beer, but I really love the shine."

tokens = data_frame(text = statement) %>% 
  unnest_tokens(tbl = ., output = word, input = text)

tokens

# A tibble: 9 x 1
  word   
  <chr>  
1 i      
2 dislike
3 beer   
4 but    
5 i      
6 really 
7 love   
8 the    
9 shine  

Now, we can compare the tokens within our statement to some pre-defined dictionary of positive and negative words.


library(tidyr)

tokens %>%
  inner_join(get_sentiments("bing")) %>% 
  count(sentiment) %>% 
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative)

# A tibble: 1 x 3
  negative positive sentiment
     <dbl>    <dbl>     <dbl>
1        1        2         1

When we use Bing’s dictionary, we see that we get one positive word (love) and negative word (dislike) with a neutral overall sentiment (a sentiment of 0 would indicate neutrality, while anything above 0 has an increasing amount of positivity and anything below 0 has an increasing amount of negativity).

Do you think that disklike and love are of the same magnitude? If I had to make a wild guess, I might say that love is stronger than dislike. Let’s switch out our sentiment library to get something with a little better notion of polarity magnitute.


tokens %>%
  inner_join(get_sentiments("afinn"))

# A tibble: 2 x 2
  word    score
  <chr>   <int>
1 dislike    -2
2 love        3

Now this looks a bit more interesting! “Love” has a stronger positive polarity than “dislike” has negative polarity. So, we could guess that we would have some positive sentiment.

If we divide the sum of our word sentiments by the number of words within the dictionary, we should get an idea of our sentences overall sentiment.


tokens %>%
  inner_join(get_sentiments("afinn")) %>% 
  summarize(n = nrow(.), sentSum = sum(score)) %>% 
  mutate(sentiment = sentSum / n)

# A tibble: 1 x 3
      n sentSum sentiment
  <int>   <int>     <dbl>
1     2       1       0.5

Our sentiment of .5 tells us that our sentence is positive, even if only slightly so.

While these simple sentiment analyses provide some decent measures to the sentiment of our text, we are ignoring big chunks of our text by just counting keywords.

For example, it is probably fair to say that “really love” is stronger than just “love”. We might want to switch over to some techniques that consider n-grams and other text features to calculate sentiment.

Smarter Sentiment Analysis

When we use sentiment analysis that is aware of context, valence (“love” is stronger than “like”), modifiers (e.g., “really love”), and adversative statements (“but,…”, “however,…”), we get a better idea about the real sentiment of the text.

We will use the jockers sentiment library, but many more available. Depending on your exact needs, there are some dictionaries designed for different applications.

Before we engage in our whole sentiment analysis, let’s take a look at a few things.

Here is the dictionary that jockers will use.


lexicon::hash_sentiment_jockers

                 x     y
    1:     abandon -0.75
    2:   abandoned -0.50
    3:   abandoner -0.25
    4: abandonment -0.25
    5:    abandons -1.00
   ---                  
10734:     zealous  0.40
10735:      zenith  0.40
10736:        zest  0.50
10737:      zombie -0.25
10738:     zombies -0.25

You might want to use View() to get a complete look at what is happening in there.

We should also take a peak at our valence shifters:


lexicon::hash_valence_shifters

              x y
  1: absolutely 2
  2:      acute 2
  3:    acutely 2
  4:      ain't 1
  5:       aint 1
 ---             
136:    whereas 4
137:      won't 1
138:       wont 1
139:   wouldn't 1
140:    wouldnt 1

With all of that out of the way, let’s get down to the matter at hand:


library(sentimentr)

library(lexicon)

library(magrittr)

statement = "I dislike beer, but I really love the shine."

sentiment(statement, polarity_dt = lexicon::hash_sentiment_jockers)

   element_id sentence_id word_count sentiment
1:          1           1          9    0.9375

We can see that we get a much stronger sentiment score when we include more information within the sentence. While the first part of our sentence starts out with a negative word (dislike has a sentiment value of -1), we have an adversarial “but” that will downweight whatever is in the initial phrase and then we will have the amplified (from “really”) sentiment of “love” (with a weight of .75 in our dictionary).

With all of this together, we get a much better idea about the sentiment of our text.

Back To The Music

While the text that we have used so far serves its purpose as an example quite well, we can always take a look at other written words.


load(url("https://raw.githubusercontent.com/saberry/courses/master/hash_sentiment_vadar.RData"))

cleanLyrics = allLyricsDF %>%
  filter(warningIndicator == 0) %>% 
  dplyr::select(lyrics, returnedArtistName, returnedSong) %>%
  mutate(lyrics = as.character(lyrics), 
         lyrics = str_replace_all(lyrics, "\n", " "),   
         lyrics = str_replace_all(lyrics, "(\\[.*?\\])", ""), # look different?
         lyrics = str_squish(lyrics), 
         lyrics = gsub("([a-z])([A-Z])", "\\1 \\2", lyrics))

songSentiment = sentiment(get_sentences(cleanLyrics), 
          polarity_dt = hash_sentiment_vadar) %>% 
  group_by(returnedSong) %>% 
  summarize(meanSentiment = mean(sentiment))

Naturally, we would want to join those sentiment values up with our original data:


cleanLyrics = left_join(cleanLyrics, songSentiment, by = "returnedSong")

From here, we have several choices in front of us. One, we could use those sentiment values within a model (e.g., we might want to predict charting position). Or, we could use them for some further exploration:


library(DT)

sentimentBreaks = c(-1.7, -.5, 0, .5, 1.7)

breakColors = c('rgb(178,24,43)','rgb(239,138,98)','rgb(253,219,199)','rgb(209,229,240)','rgb(103,169,207)','rgb(33,102,172)')

datatable(cleanLyrics, rownames = FALSE, 
              options = list(pageLength = 15, escape = FALSE, 
                             columnDefs = list(list(targets = 1, visible = FALSE)))) %>% 
  formatStyle("lyrics", "meanSentiment", backgroundColor = styleInterval(sentimentBreaks, breakColors))

We can also do some checking over time:


library(ggplot2)

load("C:/Users/sberry5/Documents/teaching/courses/unstructured/data/countryTop50.RData")

allTop50 = allTop50 %>% 
  group_by(song) %>% 
  slice(1)

cleanLyrics = left_join(cleanLyrics, allTop50, by = c("returnedSong" = "song"))

cleanLyrics %>% 
  group_by(date) %>% 
  na.omit() %>% 
  summarize(meanSentiment = mean(meanSentiment)) %>% 
  ggplot(., aes(date, meanSentiment)) + 
  geom_point() +
  theme_minimal()

That is pretty messy (but I am curious about that really happy month), so let’s try something else:


library(gganimate)

cleanLyrics %>% 
  mutate(year = lubridate::year(date), 
         month = lubridate::month(date)) %>% 
  group_by(year, month, date) %>% 
  na.omit() %>% 
  summarize(meanSentiment = mean(meanSentiment)) %>% 
  ggplot(., aes(as.factor(month), meanSentiment, color = meanSentiment)) + 
  geom_point() +
  scale_color_distiller(type = "div") +
  theme_minimal() +
  transition_states(year,
                    transition_length = length(1975:2018),
                    state_length = 3) +
  ggtitle('Year: {closest_state}')


cleanLyrics %>% 
  mutate(year = lubridate::year(date)) %>% 
  group_by(year) %>% 
  na.omit() %>% 
  summarize(meanSentiment = mean(meanSentiment)) %>% 
  ggplot(., aes(year, meanSentiment, color = meanSentiment)) + 
  geom_point() +
  theme_minimal()

Other Text Fun

Sentiment analysis is always a handy tool to have around. You might also want to explore other descriptive aspects of your text.

The koRpus package allows for all types of interesting types descriptives. There are a great number of readability and lexical diversity statistics (Fucks is likely my favorite).

We need to tokenize our text in a manner that will please koRpus.